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1  Project  Information 

1,1  Programmatic  Information 

ARPA  Onto'  Number;  Q$94 

Contract  Number:  HR0O1 1-04-1-0013 

Frrformanee  Period:  Fefemmaary  12, 20)0)4  -  May  31, 20)0)6 

Project  Title:  Collective  Inference  with  Learned  and  Eraigiflaeeired  Knowledge 

Project  URL:  http:///Mlxs.uiM^ 

Performing  Organization:  University  of  Massachusetts 

Subcontractors:  None 

Contacts: 

Principal  Investigator: 

Prof,  David  Jensen 

Computer  Science  Department,  140  Governors  Drive 
University  of  Massachusetts,  Amherst,  MA  01003  USA 
Email:  jensen@cs.umass.edu 
Phone:  (413)  545-9677,  FAX:  (413)  545-1249 
Administrative: 

Ms,  Carol  Sprague 

Office  of  Grants  and  Contracts,  Research  Administration  Bldg. 

University  of  Massachusetts,  Amherst,  MA  01003  USA 

Email:  sprague@resgs.umass.edu 

Phone:  (413)  545-0698,  FAX:  (413)  545-1202 


DARPA  Program  Mgr:  Kendra  Moore 

1.2  Project  Description 

This  section  summarizes  the  corresponding  text  as  presented  in  the  proposal. 

1 .2.1  Research  Objectives 

1.2. 1.1  Problem  Description 

A  persistent  goal  of  research  in  artificial  intelligence  has  been  to  enable  learning  and  reasoning  with 
probabilistic  models  in  complex  domains.  Much  of  this  work  has  been  directed  toward  systems  that 
complement,  rather  than  replace,  human  abilities  and  knowledge.  Models  that  fuse  engineered  knowledge 
(knowledge  from  human  sources)  with  learned  information  (information  gained  algorithmically)  can  take 
advantage  of  die  strengths  of  both  approaches,  yielding  more  accurate  predictions, 

A  particularly  fruitful  area  for  this  research  is  improving  our  understanding  of  emergent  behavior , 
specifically,  how  connectivity  among  individual  units  of  a  system  affects  global  behavior.  The  Knowledge 
Discovery  Laboratory  (KDL)  seeks  to  apply  a  growing  understanding  of  emergent  behavior  to  the  design  of 
learning  and  reasoning  systems. 

1.2.1. 2  Research  Goals 

The  main  goal  of  this  research  was  to  apply  network  analysis  to  understand  and  improve  the  performance  of 
relational  dependency  networks.  Relational  dependency  networks  (RDNs)  are  a  new  type  of  graphical  model 
that  exploit  emergent  behavior  to  improve  both  learning  and  inference.  RDNs  exhibit  emergent  behavior  by 
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1 .  Characterize  the  structure  and  dynamics  of  relational  dependency  network  (RDN)  models. 

2.  Reduce  error  of  inferred  probability  distributions. 

3.  Increase  efficiency  of  inference. 

4.  Produce  domain-specific  findings  in  at  least  two  areas. 

5.  Produce  open-source  implementations  of  algorithms. 

6.  Develop  new  evaluation  tools. 

1.2.1. 3  Expected  Impact 

The  major  benefit  of  the  work  is  increasing  the  accuracy  and  efficiency  of  learned  RDNs.  Providing  robust 
and  scalable  methods  for  learning  and  inference  in  RDNs  would  enable  the  development  of  modular  and 
easily  understood  means  for  integrating  learned  and  engineered  knowledge  in  complex  domains.  Such 
techniques  would  have  immediate  applications  in  detecting  financial  fraud,  understanding  the  structure  of  the 
Web,  analyzing  citation  patterns  in  scientific  papers  and  patent  applications,  and  assessing  organizational 
structures  in  government  and  business. 

1.2.2  Technical  Approach 

1.2.2. 1  Detailed  Description  of  Technical  Approach 

Our  approach  centered  on  analyzing  and  understanding  the  characteristics  of  and  relationships  among  the 
three  different  types  of  networks  underlying  RDNs:  a  data  network  representing  the  interrelated  objects  that 
serves  as  the  input  data  for  learning  conditional  models  of  statistical  dependencies  among  attributes  of  those 
objects;  a  model  network  representing  statistical  dependencies  among  abstract  variables  that  is  constructed 
based  on  the  learned  conditional  models;  and  an  inference  network  representing  statistical  dependencies 
among  specific  random  variables.  Improving  our  understanding  of  these  underlying  networks  for  other 
models  has  resulted  in  improvements  to  those  models;  the  same  benefit  is  anticipated  for  learning  and 
reasoning  with  RDNs. 

We  proposed  extending  our  existing  theoretical  analysis  of  how  the  properties  of  data  networks  affected 
algorithms  for  learning  conditional  models  from  relational  data.  Some  of  our  earliest  work  (e.g.,  Jensen  1999) 
showed  how  the  structure  of  a  data  network  affects  feature  selection  in  relational  learning,  and  more  recent 
work  on  autocorrelation  (Jensen  &  Neville,  2002)  and  degree  disparity  (Jensen,  Neville  &  Hay,  2003)  has 
further  added  to  our  understanding  of  these  effects.  We  have  also  shown  that  qualitative  models  of  these 
influences  allowed  us  to  design  and  implement  improved  algorithms  for  learning  conditional  models  (Neville, 
Jensen,  Friedland,  &  Hay  2003).  This  area  of  inquiry  promises  similar  benefits  for  new  models  such  as  RDNs. 

We  also  proposed  extending  our  relational  algorithms  to  exploit  ontological  information  during  learning  and 
to  provide  feedback  to  human  authors  of  ontologies.  Ontological  information  is  commonly  used  in 
aggregating  attribute  values,  a  common  technique  for  working  with  attributes  from  relational  instances  with 
variable  structure.  For  example,  a  conditional  relational  model  that  estimates  the  probability  a  given  paper  will 
be  published  might  aggregate  attributes  of  related  papers  such  as  author  count  or  page  length.  The  ontology 
laid  over  the  data  network  determines  whether  we  consider  all  related  papers  together  in  such  aggregations,  or 
whether  we  distinguish  between  related  papers  by  different  authors  and  related  papers  by  the  same  authors. 
Using  different  levels  of  an  ontology  establishes  different  equivalence  classes  and  reinterprets  the  data  for  a 
relational  learner.  Course-grained  levels  of  an  ontology  reduce  the  statistical  variance  associated  with 
aggregated  features,  and  fine-grained  levels  of  an  ontology  reduce  the  statistical  bias  associated  with 
aggregating  potentially  dissimilar  objects.  Either  of  these  effects  could  allow  construction  of  far  more  accurate 
conditional  models,  and  a  learning  algorithm  might  search  to  find  the  right  balance  between  these  effects.  In 
addition,  such  use  of  an  ontology  could  indicate  what  levels  of  an  ontology  are  most  useful  for  learning, 
providing  guidance  to  a  human  expert  about  where  to  elaborate  an  exiting  ontology. 

Because  RDNs  are  assembled  directly  from  conditional  models,  nearly  anything  that  affects  the  structure  of 
the  conditional  models  will  affect  the  structure  of  the  model  network  represented  by  an  RDN.  Our  previous 
research  on  RDNs  focused  on  learning  accurate  conditional  models;  for  this  project  we  proposed  a  more 
systematic  study  of  the  structure  of  RDNs  and  the  correlation  between  that  structure  and  the  characteristics  of 
conditional  learning  algorithms.  Specifically,  we  proposed  applying  standard  network  metrics  and  analysis 
algorithms  to  both  the  structure  of  RDNs  and  the  strength  of  the  probabilistic  dependencies  they  encode.  This 
analysis  allows  us  to  quantitatively  characterize  the  structure  of  RDNs  for  different  data  sets  and  to 
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mtwtto  and  data  networks  combine  to  produce  stractnaares;  m  Ae  inference  network.  We  planned  to  use  ©w 
analysis  of  these  properties  to  build  Aeordtical  models  and  to  design  mew  algorithms  for  teaming  conditional 
models,  EDM  eousllructiou,  and  EDM  inference. 

1 ,2.2,2  Comparison  with  Current  Technology 

Relational  dependency  networks  are  relational  models  Aat  encode  joint  probability  distributions.  These 
characteristics  make  them  well-suited  for  integrating  learned  models  with  engineered  ontologies  and  inference 
rules.  Most  other  models  lack  one  or  more  of  these  key  characteristics. 

Propositional  learners  (the  largest  class  of  current  technologies)  such  as  decision  trees,  discriminant  analysis, 
linear  regression,  Bayesian  networks  and  conventional  dependency  networks  integrate  poorly  with  common 
forms  of  engineered  knowledge.  Propositional  representations  cannot  express  Ae  rich  relational  dependencies 
that  are  common  in  the  first-order  and  higher-order  representations  used  in  Ae  knowledge  engineering 
community,  and  they  cannot  make  direct  use  of  ontologies.  Additionally,  Ae  independence  assumptions 
underlying  propositional  representations  makes  collective  inference  impossible,  limiting  Aeir  accuracy  in 
realistic  domains. 

The  remaining  set  of  models  can  be  classified  as  relational  learning  or  inductive  logic  programming  (ILP) 
models.  Some  of  these  models,  such  as  traditional  approaches  to  ILP,  are  strictly  deterministic,  eliminating 
the  strength  of  probabilistic  reasoning  found  in  many  learned  models.  OAers,  such  as  relational  probability 
frees  (Neville,  Jensen,  Friedland,  &  Hay  2003),  1BC2  (Lachiche  &  Flach  2003),  and  Ae  relational  neighbor 
classifier  (Macskassy  &  Provost  2003),  can  only  encode  conditional  probability  distributions  rather  than  joint 
distributions.  Joint  models  compactly  encode  a  larger  number  of  probabilistic  dependencies  Aan  conditional 
models  and  Aey  allow  multiple  forms  of  reasoning,  including  both  prediction  and  anomaly  detection.  Of  the 
set  of  relational  probabilistic  models  that  encode  joint  distributions,  two  issues  can  severely  limit  their 
usefulness:  Models  may  not  encode  coherent  joint  distributions  (e.g.,  Bayesian  logic  programs,  Kersting  &  De 
Raedt  2000),  or  Aey  may  be  directed  models,  which  cannot  encode  cyclic  dependencies  (Getoor  et  al.  2001). 

Only  two  types  of  models  meet  all  these  criteria:  relational  Markov  networks  (Taskar  et  al.  2002)  and  RDNi, 
Understandability  is  a  key  requirement  for  any  model  intended  to  combine  engineered  and  learned  knowledge. 
AlAough  boA  models  can  incorporate  engineered  knowledge,  we  have  chosen  to  focus  on  RDNs  because 
Aey  are  simple  to  describe  and  easier  for  domain  experts  to  understand  than  relational  Markov  networks. 
Additionally,  RDNs  offer  independent  learning  of  components  whereas  relational  Markov  networks  are  not 
selective  and  require  hand-coded  features. 

1 .2.3  Schedule  and  Milestones 

1. 2.3.1  Schedule  Graphic 

The  following  graphic  indicates  Ae  project  milestones  as  presented  in  the  proposal.  Note  Aat  Q4  FY06  and 
Q1  FY2007  are  beyond  Ae  scope  of  Ae  funded  project. 
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1. 2.3.2  Detailed  Individual  Task  Descriptions 

The  following  tasks  are  from  the  proposal,  with  overall  categories:  (1)  Data  sets;  (2)  Learning  and  ontologies; 
(3)  RDN  construction;  (4)  RDN  inference;  and  (5)  Software.  The  percentage  of  that  quarter’s  effort  devoted  to 
the  effort  in  parentheses. 

Q2  FY04 

1 . 1  Data  sets:  Identify  learning  tasks  ( 1 0%) 

1.2  Data  sets:  Develop  initial  data  sets  (50%) 

2.1  Learning  and  ontologies:  Develop  evaluation  methods  for  learning  conditional  models  (20%) 

2.2  Learning  and  ontologies:  Evaluate  current  learning  of  conditional  models  (20%) 

Q3  FY04 

1.2  Data  sets:  Develop  initial  data  sets  (continued)  (20%) 

2.2  Learning  and  ontologies:  Evaluate  current  learning  of  conditional  models  (continued)  (20%) 

2.3  Learning  and  ontologies:  Obtain  ontologies  for  evaluation  data  sets  (10%) 

3.1  RDN  construction:  Develop  evaluation  methodologies  for  RDNs  (20%) 

3.2  RDN  construction:  Evaluate  RDN  learning  with  current  conditional  models  (30%) 

Q4  FY04 

2.4  Learning  and  ontologies:  Extend  current  learning  algorithms  to  search  ontology  space  (50%) 

3.1  RDN  construction:  Develop  evaluation  methodologies  for  RDNs  (continued)  (10%) 

3.2  RDN  construction:  Evaluate  RDN  learning  with  current  conditional  models  (continued)  (10%) 

3.3  RDN  construction:  Create  engineered  conditional  models  (10%) 

4.1  RDN  inference:  Develop  evaluation  methodologies  for  inference  algorithms  (20%) 
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4.2  RDM  mfexems®:  ClaracteBze  paptos  tf  RDM$((20%) 

5.1  Software::  Relose  software  (30%) 

Q2FY05 

2.-6  Learning  and  ontologies:  Devise  and  Implement  methods  I©  assist  ontology  (70%) 

3.4  RDM  construction:  Evaluate  RDM  leamiug  with  fixed  ontologies  (30%) 

Q3FY05 

2,7  Learning  and  ontologies:  Develop  methods  for  co-learning  conditional  models  &  ontologies  (60%) 

4.3  RDM  inference:  Develop  theoretical  analysis  of  inference  complexity  (40%) 

Q4FY05 

2.7  Learning  and  ontologies:  Develop  methods  for  co-learning  conditional  models  &  ontologies  (continued) 
(60%) 

2.8  Learning  and  ontologies:  Evaluate  utility  of  co-leaming  conditional  models  &  ontologies  (40%) 

Q1 FY06 

1 .4  Data  sets:  Revise  data  sets  (20%) 

2,7  Learning  and  ontologies:  Develop  methods  for  co-learning  conditional  models  &  ontologies  (continued) 

(20%) 

2.9  Learning  and  ontologies:  Improve  efficiency  of  co-leaming  methods  (30%) 

3.5  RDN  construction:  Evaluate  RDN  learning  with  learned  ontologies  (10%) 

5.2  Software:  Release  software  (20%) 

Q2  FY06 

2.9  Learning  and  ontologies:  Improve  efficiency  of  co-leaming  methods  (continued)  (50%) 

4.4  RDN  inference:  Devise  methods  to  decompose  rolled-out  model  network  (50%) 

Q3  FY06 

4.4  RDN  inference:  Devise  methods  to  decompose  rolled-out  model  network  (continued)  (60%) 

4.5  RDN  inference:  Evaluate  efficiency  and  accuracy  for  simplified  inference  (40%) 

Q4  FY06  (Project  ended  before  this  quarter) 

4.5  RDN  inference:  Evaluate  efficiency  and  accuracy  for  simplified  inference  (continued)  (50%) 

5.3  Software:  Release  software  (50%) 

Q1  FY07  (Project  ended  before  this  quarter) 

1.5  Data  sets:  Release  revised  data  sets  (50%) 

5.3  Software:  Release  software  (continued)  (50%) 

1 .2.4  Deliverables  Description 

The  proposed  statement  of  work  listed  the  following  deliverables.  A  brief  description  of  actual  performance  i§ 
given  in  italics.  For  specific  details,  see  section  3.1.5. 

•  Software  releases — As  specified  in  die  schedule,  we  will  release  new  versions  of  Proximity 
approximately  every  12  months.  Additional  releases  are  likely,  as  new  incremental  capabilities  are 
produced. 

EDL  produced  six  major  releases  (at  approximately  six-month  intervals)  of  Proximity  during  the  contract 
period, 

*  Data  sets — As  specified  in  the  schedule,  we  will  release  new  benchmark  data  setts  to  aid  comparative 
studies  and  replication  of  our  technical  work. 

KBL  rrefamsed ffimr  hmckmmrk  data  sets  during  doe  contract  period.. 
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•  Technical  papers  —  Each  new  major  technical  finding  and  algorithm  will  be  reported  in  technical  papers 
submitted  to  workshops,  conferences,  and  scholarly  journals.  All  such  papers  will  be  made  available  on 
the  Pi's  website,  except  where  forbidden  by  copyright  restrictions. 

KDL  published  twelve  technical  papers  in  journals  and  workshop  and  conference  proceedings  during  the 
contract  period. 

Software  and  data  sets  were  released  open-source,  and  papers  and  other  written  products  are  available  under 
ordinary  copyright  agreements. 

Technology  Transition  and  Technology  Transfer,  Targets  and  Plans 

KDL’s  primary  venue  for  technology  transfer  is  through  releases  of  its  Proximity  software.  As  well  as 
providing  periodic  open-source  releases,  KDL  works  with  government  and  commercial  organizations  to  apply 
our  techniques  to  real  analysis  tasks  in  complex  domains. 


1.2.5 


Quad  Chart 


Collective  Inference  with  Learned  and  Engineered  Knowledge 


IMPACT: 


•  Decrease  by  an  order  of  magnitude  the  time 
required  to  learn  an  accurate  relational  model  of  a 
domain  over  alternative  methods  (e.g.,  relational 
Markov  networks  (RMNs)). 

•  Learn  models  of  equivalent  or  higher  accuracy 
than  alternative  relational  models  (e.g.,  RMNs) 

•  Learn  models  with  50%  reduction  in  error  over 
models  using  conditional  inference  (e.g., 
Relational  Probability  Trees). 


NOVEL  IDEAS 

•  Relational  Dependency  Networks  (RDNs) 
express  statistical  dependence  among 
characteristics  of  related  entities  (e.g.,  authors 
and  the  papers  they  write). 

•  RDNs  can  be  learned  far  more  efficiently  than 
other  similarly  expressive  models. 

•  RDNs  allow  domain  experts  to  express  prior 
knowledge  in  natural  and  highly  expressive 
ways. 


SCHEDULE: 

(see  section  1.2.3) 
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2  Ftmncfing  Report 

2A  Bunding  Obligated  this  Peiiodl 

22  Funding  Obligated  to  Date 

23  Incurred  Expenses  this  Period 

$1,000,000 

2 A  Incurred  Expenses  to  Dale 

$1,000,000 

2.5  Invoices  this  Period 

$0 

2.6  Invoices  to  Date 

$1,000,000 
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KSL  graduated  two  stiodetito  (one  Ph.D.  and  ©me  M.S.),  saw  a  post-doc  movo  oft  to  a 

faculty  pmitim,  rnd  accepted  two  new  graduate  students. 

9  Amy  McGovern,  a  postdoc  in  KDL  since  Fall  2002,  joined  die  University  of  Oklahoma  as  aft  Assistaftl 

Professor  @f  Computer  Science  for  the  fall  term,  2004. 

9  Jennifer  Neville,  KDL's  first  doctoral  student,  successfully  defended  her  dissertation  proposal  Oft  ©OtebOf 
21, 2005.  She  completed  her  dissertation  on  August  1, 2006  and  accepted  a  tenure^aek  faulty  positiOft 
at  Purdue  University,  Her  dissertation.  Statistical  Models  and  Analysis  Techniques  for  Learning  in 
Relational  Data ,  was  nominated  for  an  ACM  Doctoral  Dissertation  Award, 

•  Brian  Gallagher  completed  his  M.S.  in  Computer  Science  in  January,  2006,  He  currently  WOfki  ©ft 
developing  algorithms  and  analysis  techniques  for  complex  networks  at  the  Center  f©r  Applied  ieiifttifig 
Computing  at  Lawrence  Livermore  National  Laboratory. 

•  Tw©  new  graduate  students,  Brian  Taylor  and  Marc  Maier,  joined  KDL  in  September,  2005: 

3,1  si  Progress  Against  Planned  Objectives 

KBL  has  met  the  proposed  objectives  (listed  in  section  1.2. 1.2)  as  noted  below: 

Charaeterllt  the  structure  and  dynamics  of  relational  dependency  network  (RDM)  models. 

Wo  eoftduoted  an  extensive  experimental  evaluation  of  relational  dependency  networks.  This  analysis  looks  at 


Illative  Strengths  of  RDM  models,  namely,  the  ability  to  represent  cyclic  dependencies*  simple  methods  fef 


iirffemcsfc  rattertBam  iinjpcowaflaiHirfittiHndl  irraifbik . 
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imreme  efU efieney  «f  Umwt. 


tepitfe  tbe  me  Caarib  appn®Qclhi£Sv.  RDN  kferem®  lae^Mifcs  tabd  to  <s©BLV&g§  qufcdkty 

to  memMe  irofcMltly  diii^triillMi)tliws;T  t tkm  we  ffoosdl  wmt  am  evaluation  aamdl  off  fee 

in®  wbkfr  <rolltetiv©  inrnformce  poffoiims  effectively.. 


Pr^dnaf  domani-fpfdfic  findings  in  at  kail  two  aim. 

We  srpplfed  our  work  on  retoioua!  laiig  to  several  real-world  challenge  problems*  including  p^dklwag 
fraud  MMg  stock  brokers,  reasoning  about  robotic  movements*  and  eomtraetiug  peer-to-peer  nebvorks, 

#  The  first  project  was  conducted  jointly  with  the  National  Association  of  Securities  Dealers  (NASD)*  who 
provided  extensive  access  to  data  and  expertise  of  their  analysts  (Neville  at  at.  2005).  We  conducted  an 
extensive  analysis  of  data  from  NASD  about  stock  fraud,  applying  several  different  analysis  techniques 
and  combining  die  results  into  a  statistical  model  that  predicts  die  probability  of  fraud  for  individual 
brokers.  In  addition,  we  conducted  an  extensive  evaluation  of  the  results  on  new  data,  using  four  person- 
weeks  of  time  from  NASD  professional  examiners  to  evaluate  die  utility  of  die  results  and  comparing 
diem  to  current  NASD  screening  rules.  Model  predictions  were  found  to  correlate  highly  with  fee 
subjective  evaluations  of  experienced  NASD  examiners.  Furthermore,  in  all  performance  measures,  our 
models  performed  as  well  as  or  better  than  the  handcrafted  rules  that  are  currently  in  use  at  NASD. 

#  The  second  project  focused  on  applying  relational  dependency  networks  to  predicting  fee  outcomes  of 
movements  of  a  robotic  torso  and  was  conducted  jointly  with  the  UMass  Laboratory  for  Perceptual 
Robotics  (Hart,  Grupen  &  Jensen  2005)  .  The  resulting  predictions  were  consistent  wife  the  training  data 
and  yielded  a  policy  that  allowed  picking  up  two  differently  shaped  objects  correctly, 

#  The  third  project  focused  on  construction  of  social  networks  to  improve  peer-to-peer  networking  and  was 
conducted  jointly  with  the  UMass  Privacy,  Internetworking,  and  Mobile  Systems  Laboratory  (Fast, 

Jensen  &  Levine  2005).  We  studied  methods  of  constructing  peer-to-peer  networks  based  on  the 
preferences  of  users.  The  approach  uses  a  model  of  user  preference  identified  by  latent-variable 
clustering  with  hierarchical  Dirichlet  processes  (HDPs)  to  identify  users  who  are  likely  to  trade  files  in 
the  future.  Our  simulations  and  empirical  studies  show  that  the  clusters  of  songs  created  by  HDPs 
effectively  model  user  behavior  and  can  be  used  to  create  desirable  network  overlays  that  outperform 
alternative  approaches. 

Produce  open-source  implementations  of  algorithms. 

KDL  released  several  versions  of  Proximity,  an  open-source  environment  for  relational  knowledge  discovery, 
Proximity  includes  implementations  of  our  models  including  the  relational  Bayesian  classifier,  relational 
probability  trees,  and  relational  dependency  networks.  See  section  3.1.5,  for  additional  information  on  KDL’s 
Proximity  releases. 

Develop  new  evaluation  tools. 

Because  both  the  learning  and  inference  processes  can  introduce  errors,  relational  learning  algorithms  pose 
new  challenges  for  error  analysis,  necessitating  a  new  framework  for  analyzing  and  decomposing  these  errors.. 
We  developed  a  new  bias/variance  framework  that  decomposes  loss  into  errors  due  to  both  the  learning  and 
inference  process.  We  evaluated  performance  of  three  relational  models  on  synthetic  data  and  used  the 
framework  to  understand  the  reasons  for  poor  model  performance.  Wife  this  understanding,  we  proposed  a 
number  of  directions  to  explore  to  improve  model  performance.  The  foil  details  of  this  framework  are 
presented  in  ^Bias/variance  analysis  for  network  data”  (Neville  &  Jensen  2006). 

3.1  *2  Technical  Accomplishments 

In  addition  to  fee  accomplishments  against  planned  objectives  noted  above,  we  produced  the  following 
unplanned  accomplishments: 

Latent  Group  Models 

We  invented  and  implemented  a  new  type  of  probabilistic  model  for  relational  data,  called  a  latent  group 
nmdteL  Latent  group  models  represent  autocorrelation  dependencies  by  hypothesiznsg  a  bidden  group  entity.. 
Ibis  gready  simplifies  both  parameter  learning  and  inference,  resulting  in  a  much  more  tradable  probabilistic 
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(S^kfeoratte  iiffl  fife  fluuttom^,,  wltotar  snDOMie  will  pMtdkias©  a  proAxdt,,  o»r  wtetter  Will  %  & 

Temporal  Extemsmm  to  U&artmmal  Models 

We  examtaed  md  wMd  meiMMk  rf  extemdlmg  w  cxfetiiftg  mocMs  ©f  ircMtoiiall  data  t©  hajpdte 

l^olbaMfelfe  (fcpCTdtocio  tfatt  involve  time.  Hie  lresuftfitt  is  a  iteanopwral  extension  t©  ©w  f©llati©ual  pjfofeaMKty 
t$ee§.-  Evaluation  of  0m  mew  aflgPirMiiiTO  for  lewninmg  temporal  features  for  relational  p©feaMKty  te©$  Itvt&ted 
a  Mtmg  potential  for  medittmg  md  mew  type  of  enrols  specific  to  temporal  data, 

9 A  &  Improvements  to  Prototype 

Our  software  environment  for  relational  knowledge  discovery*  Proximity*  was  completely  rewritten  to 
convert  to  a  new  underlymg  database  structure,  MonetDB,  a  fast*  open-source  vertical  database*.  Tbe  switch  to 
MonetDB  resulted  in  orders  of  magnitude  speed  improvements  for  die  kinds  of  operations  needed  by 
relational  knowledge  discovery  compared  to  implementations  hosted  on  SQL  databases* 

We  have  also  added  numerous  new  features  and  capabilities  to  Proximity  including: 

•  an  implementation  of  relational  dependency  networks 

•  support  for  temporal  features  in  relational  probability  trees 

•  cleanup  of  model  code,  resulting  in  improved  ease  of  use  and  significant  increases  in  speed 

•  extensions  to  Proximity’s  implementation  of  the  graphical  query  language  QQraph 

•  social  networking  analysis  tools 

•  synthetic  relational  data  generation  capabilities 

•  new  aggregation  functions 

•  additional  capabilities  for  and  finer  control  of  data  import  and  export  functionality 

Usability  improvements  included  a  completely  new  graphical  user  interface  that  eliminated  some  earlier 
cumbersome  requirements  and  provided  new  tools  for  exploring,  analyzing,  modeling,  and  visualizing  data.- 
We  have  also  written  a  new  graphical  editor  for  creating  QGraph  queries,  added  an  interactive  interpreter  f@r 
executing  short  scripts,  and  created  professionally  written  documentation  for  both  Proximity  and  QGraph.* 

3,1 ,4  Significant  Changes  to  Technical  Approach 

The  focus  of  the  project  shifted  somewhat  from  making  improvements  in  the  mid-range  technologies  for 
RDNs  to  the  “edges”  —  both  the  theoretical  (making  fundamental  theoretical  advances  in  the  understanding 
of  why  RDNs  perform  so  effectively  despite  their  very  simple  approaches  to  learning  and  inference)  and  the 
applied  (evaluating  the  performance  of  ROMs  in  practical  inference  tasks).  This  shift  resulted  from  OUT  early 
evaluation  results  that  showed  that  RDNs  wore  suipiriiskngly  effective  in  their  current  form, 

3JU5  Deliverables 

We  proposed  three  types  of  ddlivoaMes::  sofiwanne  releases.,  ibemdh*  data  sets,  and  teetenfeal 

These  dehvesafoles  have  been  met  as  descinifeed  Ibdbw..  (See  section  1 2,4  for  a  <fe^jption  of  the  proposed 
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Datasets 

KDL  released  the  following  data  setts: 

#  HEP-Tb  - —  Bata  01a  papers  inn  high-energy  physics  derived  from  the  abstract  and  citation  files  pwided 
for  die  2003  KDD  Cup  competition.  The  original  datasets  are  from  arXiv„  an  electronic  archive  of 
research  papers  physics  and  selected  other  sciences,  and  the  SLAC  SPIRES-HEP  database^  a 
comprehensive  catalog  of  high-energy  particle  physics  literature  compiled  by  the  Stanford  Linear 
Accelerator  Center,  The  dataset  contains  over  42,000  objects,  over  500,000  links,  39  object  attributes*  and 
15  link  attributes, 

•  Can*o-sleep  —  Records  of  all  the  mp3  files  shared  by  and  transferred  between  users  during  an  ILday 
period  In  the  spring  of  2003.  The  dataset  contains  over  500,000  objects,  over  6  million  links,  14  object 
attributes,  and  6  link  attributes. 

•  DBLP  —  Information  on  computer  science  publications  listed  in  the  DBLP  Computer  Science 
Bibliography  derived  from  a  snapshot  of  the  bibliography  as  of  April  12, 2006,  The  dataset  contains  over 
1,200,000  objects,  over  2,480,000  links,  12  object  attributes,  and  6  link  attributes, 

•  Mobile  Social  Networks  —  Data  taken  from  a  series  of  experiments  in  wireless  mobile  connections 
undertaken  by  the  Privacy,  Internetworking,  Security,  and  Mobile  Systems  Laboratory  at  the  University 
of  Massachusetts  Amherst.  The  dataset  contains  27  objects,  over  180,000  links,  1  object  attribute,  and  2 
link  attributes. 

Technical  papers 

KDL  published  twelve  technical  papers  that  report  the  results  of  research  performed  ai  part  of  this  project.- 
The  papers  are  listed  in  section  3.1.7. 

3.1 .6  Technology  Transition  and  T ransf er 

3*  IS.  1  Technology  Transition  and  Transfer  Description 

Ail  transitioned  technologies  were  implemented  with  versions  of  Proximity,  our  open-source  environment  for 
relational  knowledge  discovery.  Details  of  the  capabilities  of  Proximity  cm  be  found  in  the  Proximity 
Tutorial,  QGraph  Guide,  Cookbook,  and  Javadoc.1 

Techndbgy  TmiritsWim  and  Transfer  List 


Our  Proximity  software  was  downloaded  morne  than  1 0,500  times  during  the  cmfaMt  period*  though  it  i§  not 
possible  to  ttrack  all  tthe  organizattrons  tthait  may  be  evaluatmg  or  using  Proximity..  1Hbat:$uM*46ofttoe 
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Rattigan,  M.,  and  D.  Jensen  (2005a).  The  case  for  anomalous  link  detection.  Proceedings  of  the  4th  Multi- 
Relational  Data  Mining  Workshop,  11th  ACM  SIGKDD  International  Conference  on  Knowledge  Discovery 
and  Data  Mining. 

Rattigan,  M.,  and  D.  Jensen  (2005b).  The  case  for  anomalous  link  discovery.  ACM  SIGKDD  Explorations , 
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3.1 .8  Meetings  and  Presentations 

April  12-13,  2004  -  The  PI  attended  a  workshop  on  new  potential  DARPA  programs  in  machine  learning  in 
Philadelphia,  PA.  He  presented  KDL's  current  work  on  relational  learning  and  proposed  several  potential 
challenge  problems. 

June  10,  2004  -  The  PI  spoke  at  a  workshop  on  social  network  analysis  held  at  Yahoo  Labs  in  Pasadena,  CA. 
His  talk  presented  results  of  early  work  on  relational  dependency  networks. 

September  22-23,  2004  -  The  PI  presented  a  talk  on  relational  learning  methods  at  the  DHS  Data  Sciences 
Workshop  in  Washington,  DC. 

October  3,  2004  -  The  PI  presented  a  talk  at  the  “Turning  Information  into  Knowledge”  workshop  sponsored 
by  the  International  Atomic  Energy  Agency  in  New  York,  NY. 

January  3 1 -February  4,  2005  -  The  PI  and  a  graduate  student  (Jennifer  Neville)  presented  two  talks  at  the 
Dagstuhl  seminar  on  Probabilistic,  Logical  and  Relational  Learning.  Neville's  talk  was  on  "Leveraging 
relational  autocorrelation  with  latent  group  models"  and  Jensen's  talk  examined  the  question  "Does  Accurate 
Statistical  Inference  Require  Joint  Models  of  Attributes  and  Relations?"  The  invitation-only  seminar  was  a 
major  week-long  international  meeting  of  researchers  in  relational  learning  held  in  southwestern  Germany. 

March  7-8,  2005  -  The  PI  participated  in  a  workshop  convened  by  the  U.S  Treasury  and  NSF  entitled 
"Resilient  Financial  Information  Systems"  in  Washington,  DC. 

May  31 -June  1,  2005  -  The  PI  gave  an  invited  talk  at  the  Pacific  Northwest  National  Laboratories  in  Pasco, 
Washington.  The  talk  discussed  learning  probabilistic  models  and  implications  for  data  visualization. 
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Summer,  2005  —  Presented  papers  at  major  conferences  and  workshops  —  Students  gave  six  presentations  of 
accepted  papers  at  major  conferences  and  workshops  (Fast,  Jensen,  and  Levine  2005;  Hart,  Grupen,  and 
Jensen  2005;  Neville  et  al.  2005;  Neville  and  Jensen  2005b;  Neville  and  Jensen  2005c;  Ratdgan  and  Jensen 
2005a). 

July  9-10, 2005  -  Participated  in  AAAI  Doctoral  Consortium  —  KDL  students  Jennifer  Neville  and  Ozgur 
Simsek  presented  their  work  at  the  Tenth  AAAI/SIGART  Doctoral  Consortium  in  Pittsburgh,  Pennsylvania 
during  the  Twentieth  National  Conference  on  Artificial  Intelligence.  Their  presentations  were  "'Structure 
Learning  for  Statistical  Relational  Models”  and  ""Towards  Competence  in  Autonomous  Agents'*.  The  AAAI 
and  ACM/SIGART  Doctoral  Consortium  provides  an  opportunity  for  a  group  of  Ph.D.  students  to  discuss  and 
explore  their  research  interests  and  career  objectives  with  a  panel  of  established  researchers  in  artificial 
intelligence. 

September  26-27, 2005  -  Attended  National  Academy  of  Sciences  meeting  —  The  PI  gave  an  invited  talk  at  a 
National  Academy  of  Sciences  meeting  on  Statistics  on  Networks  in  Washington,  DC. 

October  1 1-12, 2005  -  The  PI  participated  in  a  DARPA  workshop  on  Behavioral  Economics  and  Networks  in 
Philadelphia,  PA. 

October  18-19, 2005  -  The  PI  participated  in  a  DARPA  DSO  workshop  on  Virtual  Worlds  in  Washington, 
DC. 

October  20, 2005  -  The  PI  gave  an  invited  talk  at  the  Computer  Science  Department  at  Tufts  University. 

3.1 .9  Issues  or  Concerns 

(None) 

3.2  Project  Plans 

The  contract  is  completed. 
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